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In the eight years since we last examined the amino 
acid exchanges seen in closely related proteins/ the infor- 
mation has doubled sn quantity and comes from 3 much 
wider variety of protein types. The matrices derived from 
these data that describe the amino acid replacement prob- 
abilities between two sequences at various evolutionary 
distances are more accurate and the scoring matrix that is 
derived is more sensitive in detecting distant relationships 
than the one that we previously derived** * 3 The method 
used in this chapter is essentially the same as that de- 
scribed in th&Attas, Volume 3 4 and Volume S} 

Accepted Point Mutations 

An accepted point mutation in a protein is a replace- 
ment of one amino acid by another, accepted by natural 
selection. It is the result of two distinct processes: the 
first is the occurrence of a mutation in the portion of the 
gene template producing one amino acid of a protein; the 
second is the acceptance, of the mutation by the species 
as the new predominant form. To be accepted, the new 
amino acid usualfy must function in a way similar to the 
old one: chemical and physical similarities are found be- 
tween the amino acids that are observed to interchange 
frequently. 

Any complete discussion of the observed behavior of 
amino acids in the evolutionary process must consider the 
frequency of change of each amino acid to each other one 
and the propensity of each to remain unchanged. There 
are 20 X 20 = 400 passible comparisons. To collect a use- 
fuJ amount of information on these, a great many observa- 
tions are necessary. The body of data used in this study 
includes 1,572 changes in 71 groups of closely related 
prote*i)s_ap pea ring in the Atlas volumes through Supple- 
ment 2, 

The mutation data were accumulated from the phylo- 
genetic trees and from a few pairs of related sequences. 
The sequences of all the nodal common ancestors in each 
tree are routinely generated. Consider, for example, the 
much simplified artificial phylogenic tree of Figure 78, 



The matrix of accepted point mutations calculated from 
this tree is shown in figure 79s We have assumed that the 
likelihood of amino acid X replacing Y is the same as that 
of Y replacing X, aind hence 1 is entered in box YX as 
well as in box XY. This assumption is reasonable, because 
this likelihood should depend on the product of the fre- 
quencies of occurrence of the two amino acids and on 
their chemical and physical similarity. As a consequence 
of this assumption, no change in amino acid frequencies 
over evolutionary distance will be detected, 

By comparing observed sequences with inferred ances 
tral sequences, ratr*er than with each other, a sharper 




Figure 78, Simplified pttyJogenetic tree. Four "observed" proteins 
are shown at the top, inferred ancestors are shown at the nodes. 
Amino acid exchanges are indicated along the branches. 
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Figure 79* Matrix of accepted point mutations derived from the 
tree of Figure 78- 



345 



346 ATLAS OF PROTEIN SEQUENCE AND STRUCTURE 1978 



picture of the acceptable point mutations is obtained. In 
the first amino acid position of the illustration in Figure 
78, A changes to 0 and A changes to C, but C and D do 
not directly interchange. If we had compared the observed 
sequences directly, we would have inferred the change of 
C to D also, {In practice, some of the positions in the 
nodal sequences are blank [ambiguous]. For these, we 
have treated the changes statistically, distributing them 
among all observed alternatives.} 

The totaf numbers of accepted point mutations ob- 
served between closely related sequences from 34 super- 
families, grouped into 71 evolutionary trees, are shown 
in Figure 80. in order to minimize tfte occurrence of 
changes caused by successive accepted mutations at one 
site, the sequences within a tree were Jess than 15% dif- 
ferent from one another and ancestral sequences were 
even doser. Of the 190 possible exchanges shown in 
Figure 80 r 35 were never observed. These usually involved 
the amino acids that occur infrequently and are not 
highly mutable and exchanges where more than one 



nucleotide of the codon must change. Of the 1,572 ex- 
changes the largest number, 83, was observed between 
Asp and Giu, two chemically very similar amino acids 
with codons differing by one nucleotide. About 20% of 
the interchanges, far more than one would expect for such 
similar sequences, involved amino acids whose codons 
differed by more than one nucleotide, Presumably, in 
any one tree, changes at some of the amino acid positions 
are rejected by selection and multiple changes at the 
mutable sites are favored. Many of the changes expected 
from the mutations of one nucleotide in a codon are 
seldom observed. Presumably these mutations have 
occurred often but have been rejected by natural selection 
acting on the proteins. For example, there were no ex- 
changes between Gly and Trp. 

Mutability of Amino Acids 

A complete picture of the mutational process must 
include a consideration of the amino acids that did not 
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change, as wet* as those that did t For this we need to 
know the probability that each amino acid win change in 
a given smalt evolutionary interval We call this number 
the "relative mutability" of the amino acid, 

in order to compute the relative mutabilities of the 
amino acids, we simply count the number of times that 
each amino acid has changed in an interval and the num- 
ber of times that it has occurred in the sequences and 
thus has been subject to mutation. The relative mutability 
of each amino acid is proportional to the ratio of changes 
to occurrences, Figure 81 illustrates the computation for a 
simple case in which B changes relatively often, A less 
often, and D never. 
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Figure 8t, Sample computation of relative mutably. The two 
aligned sequences may be two experimental* y observed sequences 
or an observed sequence and its inferred ancestor. 



In calculating relative mutabilities from many trees, 
the information from sequences of different lengths and 
evolutionary distances is combined. Each relative muta- 
bility is still a ratio, The numerator Is the total number of 
changes of this amino acid on aH branches of all protein 
trees considered. The denominator is the total exposure 
of the amino acid to mutation, that is, the sum for all 
branches of its local frequency of occurrence multiplied 
by the total number of mutations per 100 links for that 
branch. 

The relative mutabilities of the amino acids are shown 
in Table 21 . On the average, Asn, Ser, Asp, and Glu are 
most mutable and Trp and Cys are least mutable, 

The immutability of cysteine is understandable. 
Cysteine Is known to have several unique, indlspensibie 
functions. It is the attachment site of heme groups in 
cytochrome and of FeS clusters in ferredoxm. It forms 
crosslinks in other proteins such as chymotrypsin or 
n ban uc! ease, It seldom occurs without having an impor- 
tant function* 

The substitution of one of the larger amino acids of 
distinctive shape and chemistry for any other is rather 
uncommon. At the other extreme, the low mutability of 
glycine must be due to its unique smaliness that Is ad- 
vantageous in many places, Even though serine sometimes 
functions in the active center, it much more often per* 



Tabfe 21 

Relative Mutabilities of the Amino Acids 3 
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forms a function of lesser importance, easily mimicked by 
several other amino a^ids of similar physical and chemical 
properties. On the average it is highly mutable, 

Amino Acid Frequencies in the Mutation Data 

The relative frequencies of exposure to mutation of the 
amino acids are shown in Table 22, These frequencies, fj, 
are approximately proportion at to the average composi- 
tion of each group multiplied by the number of mutations 
m the tree. The sum of the frequencies is 1 . 

Mutation Probability Matrix for the 
Evolutionary Distance of One PAM 

We can combine information about the individual 
kinds of mutations and about the relative mutability of 
the amino acids into one distance- dependent "mutation 
probability matrix'' (see Figure 82), An element of this 
matrix, M^, gives the probability that the amino acid in 
column j will be replaced by the amino acid in row I after 
a given evolutionary interval, in this case 1 PAM. 



Table 22 



Normalized Frequencies of the Amino Acids 
in the Accepted Point Mutation Data 
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The nondiagonai dements have the values: 
XnrijAjj 



where 

is an element of the accepted point mutation matrix 
of Figure 80, 

X is a proportionality constant, and 

rn^ is the mutability of the jth amino acid, Table 21, 

The diagonal elements have the values; 

Consider a typical column, that for alanine. The total 
probability, the sum of at] the elements, must be 1, The 



probability of observing a change m a site containing 
alanine {the sum of aU the elements except M AA } j$ 
proportional to the mutability of alanine. The same pro* 
portionaiity constant, X, holds for all columns. The in* 
dividual nondiagona* terms within each column bear the 
same ratio to each other as do the observed mutations in 
the matrix of Figure 80. 

The quantity 100 X SfjMjj gives the number of amino 
acids that will remain unchanged when a protein 100 links 
long, of average composition, is exposed to the evolu- 
tionary change represented by this matrix. This apparent 
evolutionary change depends upon the choice of X, in this 
case chosen so that this change is 1 mutation. Since there 
are almost no superimposed changes, this also represents 
1 PAM of change, M X had been four times as targe, the 
initial matrix would have represented 4 PAMs; the discus- 
sion which follows would not be changed noticeably. 
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Figure 8?, Mutation piobebsliiv matrix for tho evolutionary 1 accepted potnt mutation per 100 amino acids. Thus, there is a 

fiance of 1 PAM. An ekwM of this matri*, M. jp yives ihe prot^ 0.56% probability thai A 5 p be replaced by Glu. To simplify 

atnlity ihs* The amino ac^i >n column j wiH be replaced by thft the appearance, the events are sho^n m^i^pNed by 10,000. 
r.iJinmo acict in row i ttUvr a yi^er\ ^vol^Tiona.ry interval, m this ca^s 
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Simulation of the Mutational Process 

For evaluating statistical methods of detecting retatioiv 
&hips r for developing methods of measuring evolutionary 
distances between proteins, and for determining the 
accuracy of programs to construct evolutionary trees, we 
need to have examples of proteins at known evolutionary 
distances. The mutation probability matrix provides the 
information with which to simulate any amount of evolu- 
tionary change in an unlimited number of proteins- 
Further, we can start with one protein and simulate its 
separate evolution in duplicated genes or in divergent 
organisms. By considering many groups of sequences 
related by the same evolutionary history, a measure is 
readily obtained of the expected deviations due to ran- 
dom fluctuations in the evolutionary process. 

If we only require that, on the average, one mutation 
takes place in the evolutionary interval of. 1 RAM, we can 
use a simulation requiring one random number for each 
amino acid in the sequence, as follows: To determine the 
fate of the first amino acid, say Ala, a uniformly distrib- 
uted random number between 0 and 1 is obtained. The 
first column of the mutation probability matrix (Figure 
82} gives the relative probability of each possible event 
that may befall Ala {neglecting deletion for simplicity*. 
If the random number falls between 0 and .9867, Ala is 
teft unchanged. If the number is between ,9887 and 
.9868, it is replaced with Arg, if it is between ,9868 and 
_9872, it is replaced with Asp, and so forth. Similarly, a 
random number is produced for each amino acid in the 
sequence, and action is taken as dictated by the corre- 
sponding column of the matrix. The result is a simulated 
mutant sequence. Any number of these can be generated; 
their average distance from the original 1 PAM although 
some may have no mutations and some may have two or 
more. The effects on the sequence of a longer period of 
evolution may be simulated by successive applications of 
the matrix to the sequence resulting from the last applica- 
tion. 

For simulations in which a predetermined number of 
changes are required, a two-step process involving two 
random numbers for each mutation can be used. Starting 
with a given sequence, the first amino acid that wilt 
mutate is selected: the probability that any one will be 
selected is proportional to its mutability {Table 21 J. Then 
the amino acid that replaces it is chosen. The probability 
for each replacement is proportional to the elements in 
tiie appropriate column of Figure 82. Starting with the 
resultant sequence, a second mutation can be simulated, 
and so on, until a predetermined number of changes have 
been made. In this process, superimposed and back muta* 
tions may occur. 



The 1 PAM matrix can be multiplied by itself N times 
to yield a matrix that predicts the amino acid replace- 
rnents to be found after N PAMs of evolutionary change 
in a sequence of average composition. On the average, 
the results of the simulations above match the predictions 
of the corresponding matrices, 

Mutation Probability Matrices for 
Other Distances 

The mutation probability matrix t corresponding to 
1 PAM, has a number of interesting properties {see Figure 
82}, If, in a simulation, it is applied to a protein with the 
average amino acid composition given in Table 22, on the 
average, the composition of the resulting mutated proteins 
will be unchanged. Repeated applications of the matrix 
to proteins of any other composition will give mutants 
that change toward average composition; any such matrix 
has implicit in it some particular asymptotic composition, 

There is a different mutation probability matrix for 
each evolutionary interval. These can be derived from the 
one for 1 PArVl by matrix multiplication. If the 1-PAM 
matrix is multiplied by itself an infinite number of times, 
each column of the resulting matrix approaches the 
asymptotic amino acid composition; 



f R f R f fi f R . . . 



At a great distance, there is very little relationship infor- 
mation left in the matrix. For example, at a distance of 
2,034 PAMs, all of the matrix values are within 5% of 
their limiting values except for the Trp-Trp element, 
which is 75% higher than the limit, and the Cys-Cys ele- 
ment, which is 11% higher. 

The matrix for 0 PAMs is simply a unit diagonal; no 
amino acid would have changed: 

10 0,,. 
M ° ~ 0 0 1 



The mutation probability matrix for 250 PAMs is 
shown in Figure 83, At this evolutionary distance, only 
one amino acid in five remains unchanged. However, the 
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amino acids vary greatly tn their mutability; 55% of the 
tryptophans, 52% of the cysteines and 27% of the glycines 
would still be unchanged, but only 6% of the highly muta- 
bte asparagines would remain. Several other amino acids, 
particularly alanine, aspartic acid, glutamic acid, glycine, 
lysine, and serine are more likely to occur in place of an 
original asparagine than asparagine itself at this evolu- 
tionary distance! This is understandable from the data 
giving the preferred mutations and the relative mutabili- 
ties. Asparagine is highly mutable, therefore it changes to 
other amino acids. These are less mutable and may not 
change again. This effect is much more conspicuous in 
the case of methionine. Surprisingly, a methionine origi- 
nate present would have changed to leucine in 20% of 
the cases, but would remain methionine in only 6%. Over 
one-third of the mutations in methionine are specificaMy 
to leucine (Figure 80}, Leucine is less than one half as 
mutable as methionine (Table 21) h 



From the series of distance- dependent mutation prob- 
ability matrices, we can compute detailed answers to the 
question "How does the evolutionary process affect the 
similarity of related protein sequences?" 

Estimation of Evolutionary Distance 

There is a different mutation probability matrix for 
each evolutionary interval measured in PAWIs. For each 
such matrix, we can calculate the percentage of amino 
acids that will be observed to change on the average in 
the interval by the formula: 

100(1 - SfjMjj) 

Tabie 23 shows the correspondence between the observed 
percent difference between two sequences and the evolu- 
tionary distance in PAMs. We use this scale to estimate 
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Figure S3. Mutation probabMity maTns for the evotutionar v cJi&- sequence wviJl Eiontain Aia m the second. There is 3 3% chance 

lance of 2BO PAMs. To simphfy the appparante. the &vt^-<\\% rhgr jt will cor>!«in Arg, and so forth. The relationship of ^o se- 

are show?t multipliftci b v ^00. In comparing two st^u-nc^ o« quinces at a tf^taotze of 250 PAMs can o^ dSFno^ratect t>v sta- 

tfv^jQi? amsno acid fr^ouei^^v ^t th^ evokjfioriarv fiista'V;^, i^(?n-' tisticaS methods. 
% o 13"- fjrohfll'jiiiTv ti^iii a position con-tamin^ A I es \<r> g\u f = r it 
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Table 23 

Correspondence between Observed Differences 
and the Evolutionary Distance 



Observed 


Evolutionary 


Percent 


Distance 


Difference 


in PAMs 


1 


1 


5 


5 


10 


11 


15 


17 


20 


23 


25 


30 


30 


38 


35 


47 


40 


B6 


45 


67 


50 


80 


55 


94 


60 


112 


65 


133 


70 


159 


75 


195 


80 


246 


85 


328 



evolutionary distances from matrices of percent difference 
between sequences. These estimated distances were used 
in the computations of evolutionary trees in this book. 
The differences predicted for a given PAM distance differ 
by up to 23% from those that we reported in Volume 5. 
A more complete scale is given in Table 36 of the Appen- 
dix, 

Relatedrtess Odds Matrix 

The elements, Mjj, of the mutation probability matrix 
for each distance give the probability that amino acid j 
will change to Un a related sequence in that interval The 
normalized frequency f 5 gives the probability that i wilt 
occur in the second sequence by chance. 

The terms of the reiatedness odds matrix are then: 



The odds matrix is symmetrical Each term gives the prob- 
ability of replacement per occurrence of i per occurrence 
oH 



Amino acid paks with scores above 1 replace each 
other more often as alternatives in related sequences than 
m random sequences of the same composition whereas 
those with scores below 1 replace each other less often. 

The information in the 250-PAM odds matrix has 
proven very useful in detecting distant relationships be- 
tween sequences. When one protein is compared with 
another, position by position, one should multiply the 
odds for each position to calculate an odds for the whole 
protein. However, it is more convenient to add the loga- 
rithms of the matrix elements. The log of the 250-PAM 
odds matrix is shown in Figure 84, 

The Chemical Meaning of Amino 
Acid Mutations 

Patterns have been visible in the accepted point muta- 
tions since the beginning of protein sequence work. 
Isoleucine-vafine and serine - threonine were frequently 
observed alternatives. It was obvious that this interchange- 
ability had something to do with their chemical similari- 
ties. In the large amount of information that now exists, 
far more detailed correlations are visible, and many more 
functional inferences can be made. 

In the log odds matrix of Figure 84, the order of the 
amino acids has been rearranged to show clearly the 
groups of chemically similar amino acids that tend to 
replace one another: the hydrophobic group; the aromatic 
group; the basic group; the acid, acid-amide group; 
cysteine; and the other bydrophiHc residues. Some groups 
overlap: the basic and acid r acid-amide groups tend to 
replace one another to some extent, and phenylalanine 
interchanges with the hydrophobic group more often than 
chance expectation would predict. These patterns are 
imposed principally by natural selection and only second- 
arily by the constraints of the genetic code: they reflect 
the similarity of the functions of the amino acid residues 
in their weak interactions with one another jn the three- 
dimensional conformation of proteins. Some of the 
properties of an amino acid residue that determine these 
interactions are: size, shape, and local concentrations of 
electric charge; the conformation of its van der Waals 
surface; and its ability to form salt bonds, hydrophobic 
bonds, and hydrogen bonds. 

Computing Relationships between Sequences 

We use tog odds matrices as scoring matrices for detect- 
ing very distant relationships between proteins. Such scor- 
ing matrices, based ultimately on accepted point muta- 
tions, can discriminate significant relationships from 
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Figure 84. Log odds matrix For 250 PAMs, Elements are shown 
multiplied by 10. The neutral score is zero. A score of -10 means 
that The pa^r would be expected to occur only ortfi-tenth as fre- 
quently in refated sequences as random chance would predict, and 



3 score of +2 means that the pair would be expected to occur 1 .6 
times as frequemfy. Trie order of the amino acid$ has been ar- 
ranged to tJlustrete the patterns in the mutation data. 



random coincidences better than simpler scoring systems. 
Mere counts of identities and matrices based only on the 
changes predicted by the genetic code are not sufficiently 
complex. It is obvious that there is a good deal of infor- 
mation in the ctetaiied nature of both the nomdentities and 
the identities. Certain combinations of different amino 
acids are positive evidence of relatedness, and others are 
contraindications. The \oq odds matrix for 250 PAMs, 
which we have found to be a very effective scoring matrix 
for detecting distant relationships, h compared with other 
matrices in chapter 23. 
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